Domain adaptation for supervised integration of scRNA-seq data

Large-scale scRNA-seq studies typically generate data in batches, which often induce nontrivial batch effects that need to be corrected. Given the global efforts for building cell atlases and the increasing number of annotated scRNA-seq datasets accumulated, we propose a supervised strategy for scRNA-seq data integration called SIDA (Supervised Integration using Domain Adaptation), which uses the cell type annotations to guide the integration of diverse batches. The supervised strategy is based on domain adaptation that was initially proposed in the computer vision field. We demonstrate that SIDA is able to generate comprehensive reference datasets that lead to improved accuracy in automated cell type mapping analyses.


Supplementary Note 2: Comparison of computational efficiency
We examined and compared the computational efficiency of SIDA, four unsupervised algorithms (Seurat, Harmony, Limma, scAlign), and two supervised algorithms (scAlign+ and LAmbDA) across three data collections (pancreas, PBMC and gut). In our implementation of the integration algorithms that rely on deep learning and GPU, the GPU we used was NVIDIA Quadro P2200. Table S2 shows that algorithms without deep learning strategy (Seurat, Harmony and Limma) are computationally much cheaper than the other four deep-learning-based algorithms (SIDA, scAlign, scAlign+, and LAmbDA). Among all the algorithms, SIDA achieves best integration performance and has the longest computing time. This result represents a trade-off between performance and computational cost. In terms of how to decide which integration method should be used, we would suggest the following. When integrating newly generated datasets without prior analysis and cell type labels, we have to use unsupervised integration algorithms, such as Seurat and Harmony. When integrating previously analyzed datasets with cell type labels available, we believe it is better to use supervised integration, and we would prefer our method, SIDA. Although Table S2 shows that SIDA requires longer computing time compared to other deep-learning-based supervised integration algorithms, we have demonstrated in Table 1 and Figure 3c that SIDA significantly outperformed the other methods in the gut data collection, which is the most challenging data collection analyzed in this study.

Supplementary Note 3: UMAP visualization of integrated data
In Figure 2 of the main text, we provided tSNE visualizations of the embedding space generated by SIDA, four unsupervised algorithms (Seurat, Harmony, Limma, scAlign) and two supervised algorithms (scAlign+ and LAmbDA), across three data collections (pancreas, PBMC and gut). In Figures 4 and 5 of the main text, we provided tSNE visualizations of the embedding space generated by SIDA, scAlign and scAlign+ based on two data collections (pancreatic islet and HSC). In this supplemental section, we provide the corresponding UMAP visualizations of these analyses, in Figures S1, S2 and S3, respectively.
For the pancreas data collection, the integration results are shown in the UMAP visualizations in Figure S1(a-b), colored by cell types and batch labels. Seurat and Harmony successfully mixed the different batches, as shown in the 2 nd and 3 rd columns in Figure S1(b). However, when colored by cell type labels, the 2 nd and 3 rd columns of Figure S1(a) show that Seurat and Harmony improperly aligned some of the distinct cell types in different batches, e.g. stellate and mesenchymal, acinar and ductal. In the 4 th -6 th columns of Figure S1(a-b), we observe that Limma, scAlign and scAlign+ performed poorly, where the same cell type in different batches did not align and mix with each other. The last column of Figure S1(a) shows that LAmbDA successfully aggregated and mixed the different batches. However, the last column in Figure S1(b) shows that LAmbDA failed to separate different cell types properly. From the 1st column in Figure S1(a-b), we can observe that SIDA achieved better cell type separation and batch mixing, compared to the four unsupervised and the two supervised methods.
For the PBMC data collection, the integration results are shown in the UMAP visualization in Figure  S1(c-d), colored by cell types and batch labels. The 1st column of Figure S1(c-d) shows that SIDA performed well on PBMC data collection, achieving proper mixing of different batches. Based on the 2 nd and 3 rd columns of Figure S1(c-d), Seurat and Harmony mixed the different batches, but improperly aligned two similar cell types: CD4 T and CD8 T. Based on the 4 th -6 th columns of Figure  S1(c-d), we observe that Limma, scAlign and scAlign+ failed to properly integrate the PBMC data collection, which is consistent with their performance in the pancreas data collection. From the last column of Figure S1(c-d), we can observe that LAmbDA did not separate different cell types properly.
For the gut data collection, the integration results are shown in the UMAP visualization in Figure  S1(e-f), colored by cell types and batch labels. As shown in the 2 nd -6 th columns of Figure S1(e-f), Seurat, Harmony Limma, scAlign and scAlign+ did not effectively mix the batches, and did not properly align corresponding cell types in different batches. The last column of Figure S1(e-f) shows that LAmbDA successfully mixed the four different batches, but improperly aligned different cell types.